Add force resume support for cross-cluster replication#1685
Add force resume support for cross-cluster replication#1685mohit10011999 wants to merge 19 commits into
Conversation
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
This reverts commit 0e4b126. Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
c20e107 to
fb17424
Compare
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
- Removed reimplemented block removal/addition (reuses UpdateIndexBlockAction, same as stop action) - Removed reimplemented index deletion (reuses same pattern as IndexReplicationTask.cancelRestore) - Removed reimplemented lease cleanup (reuses existing RemoteClusterRetentionLeaseHelper methods) - Removed ReplicationMetadataManager dependency from coordinator (not needed) - Simplified getLeaderGlobalCheckpoint using firstOrNull (same API as RemoteClusterRepository) - Only genuinely new logic retained: pre-restore lease acquisition at leaderCheckpoint+1 Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit fdf3669)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to fdf3669 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit a23f7a8
Suggestions up to commit 5017605
Suggestions up to commit a8e76d6
|
a8e76d6 to
5017605
Compare
|
Persistent review updated to latest commit 5017605 |
|
Regarding #1685 (comment) Regarding Race Condition comment:- Incomplete Cleanup Logic Error Null seqNoStats |
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
5017605 to
a23f7a8
Compare
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit a23f7a8.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit a23f7a8 |
… boundary Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
|
#1685 (comment) Replaced .all() with .clear() followed by .translog(true) — this tells the leader to only compute translog-related stats, dramatically reducing the payload sent across the cluster boundary. |
|
Persistent review updated to latest commit fdf3669 |
Description
When retention leases expire on the leader cluster (e.g., due to prolonged pause or leader translog rollover), the normal resume API fails because it cannot catch up from where it left off. This PR adds a force_resume option that triggers a snapshot-based bootstrap to recover replication without requiring a full stop + restart cycle.
Solution
Added force_resume=true parameter to the resume replication API:
When force_resume=true and retention leases are missing, the ForceResumeCoordinator orchestrates:
Remove index block - unblocks the follower index (same as stop action)
Acquire retention leases at leaderGlobalCheckpoint + 1 per shard - prevents the race condition where the leader's translog is purged during the async snapshot restore
Delete follower index - triggers the IndexReplicationTask state machine to re-bootstrap from snapshot via setupAndStartRestore()
On failure at any step, cleanup logic removes partially acquired leases and re-adds the index block
Backward Compatibility
The force_resume field defaults to false, so existing clients calling _resume with an empty body or without the field see no behavior change
Wire serialization is additive (new boolean field appended)
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.